Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#360 Bwt #411

Merged
merged 62 commits into from
Jan 18, 2024
Merged

#360 Bwt #411

merged 62 commits into from
Jan 18, 2024

Conversation

TwFlem
Copy link
Contributor

@TwFlem TwFlem commented Dec 6, 2023

Changes in this PR

This is for #360. @TimothyStiles mentioned implementing a naive version of the BWT. At first I took his suggestion. As I read up on the applications of BWT for genomic data, the more it seemed important that we have as small as a memory footprint as possible. I then began to optimize Some things up front. And well... I ultimately followed Tim's suggestion about keeping it naive so we have something to start with :)

Also, this does not include the fuzzy matching that @Koeng101. LMK if we want that in here, but that might be appropriate as its own thing since there's already a fair amount going on in this PR.

There are a good chunk of auxiliary pieces to the BWT. I will talk about what was used in this PR along with some suggested ways we move forward to improve things. I will also point out the especially naive parts that most definitely need to be addressed.

The BWT and other data structures added to enable alignment are documented in comments.

Things to Consider for the future:

  1. Move towards using an RLFM and or r-index for significant space savings.
  2. See if the Wavelet Tree would be faster if it were just a contiguous piece of memory
  3. suffix array sampling- especially if we do 1. in the future
  4. Better select implementation for the RSA Bitvector

Why are you making these changes?

Tim hinted at a need for a way to compress sequences while still being able to analyze them in #360

Are any changes breaking? (IMPORTANT)

This is all net new!

Pre-merge checklist

  • New packages/exported functions have docstrings.
  • New/changed functionality is thoroughly tested.
  • New/changed functionality has a function giving an example of its usage in the associated test file. See primers/primers_test.go for what this might look like.
  • Changes are documented in CHANGELOG.md in the [Unreleased] section.
  • All code is properly formatted and linted.
  • The PR template is filled out.

@Koeng101
Copy link
Contributor

Koeng101 commented Dec 6, 2023

Very cool! Tell me when you need a code review

Agree that the first version should be naive. Can improve from there!

@TimothyStiles
Copy link
Collaborator

Hey @TwFlem thanks for submitting this draft!

For naive I meant really, really naive. I'm not sure what everything here is doing yet but here's a simple 30 line BWT implementation that works:

func burrowsWheelerTransform(input string) string {
	// Add a sentinel character to the end of the input
	input += "$"

	// Create a matrix to store rotations
	rotations := make([]string, len(input))
	for i := 0; i < len(input); i++ {
		rotations[i] = input[i:] + input[:i]
	}

	// Sort the rotations lexicographically
	sort.Strings(rotations)

	// Extract the last characters of each rotation to get the transformed string
	var result string
	for _, rotation := range rotations {
		result += string(rotation[len(rotation)-1])
	}

	return result
}

Obviously yours is more complex and very likely faster but I'm not sure what's going on yet. What's the best way to compare this^^ toy version with your own in this PR? What references did you use to create your draft? Any links?

-Tim

@TwFlem
Copy link
Contributor Author

TwFlem commented Dec 7, 2023

Sure! So I read through many materials, but the most salient of them all came from Ben Langmead. He has a whole youtube playlist about the subject.

I wish it wasn't so complex! So the BWT by itself is actually a small piece of the puzzle- there are auxiliary components that go along with the BWT to make it possible to both compress the size from the original string while also making it usable as a search index to perform other kinds of analysis like sequence alignment.

The equivalent to what you wrote above is here in the pr (ignoring the skiplist and the part about wavelet tree). The only difference with what I have here is that it saves some memory because we don't have to store NxN matrix of the string before we sort it. The getBWTIndex is what allows us to get that original character from the provided sequence that would show up in the L column for that given rotation.

Everything else in this PR is what is necessary to enable search and alignment.

To sum up what I have in the PR and why it is there at a high level:

  1. Bitvector- gives us a way to encode information in a memory efficient way. For example, instead of needing 1 byte to represent a character, all we need it 1 bit in the bitvector.
  2. RSABitvector- RSA stands for Rank, Select, Access. So far, only Rank and Access are needed in the current implementation. However, select is needed for the RLFM/r-index implementations which would be the most usable implementation for anyone trying to do any serious work with the BWT. More on RLFM and r-index later though.
  3. Wavelet tree- This is a data structure that allows us to make RSA queries on strings. TL;DR it is a binary tree that allows us to sub divide a given string and corresponding alphabet so we can make RSA queries. This is probably the most complex thing. I suggest just giving this a quick once over.
  4. LF mapping: LF stands for Last and First: Last column of the BWT transform and First column of the BWT transform. In this PR, F is the skiplist and L is just l for now. This is actually how we conduct search and alignment. In the Count, Locate, and Extract queries, you'll see we start with the whole range of both coulmns as the search range and use the skiplist along with ranking the L column to narrow down that search range. I suggest watching this to see how the backwards search is conducted. It's a lot easier to see what's going on in the visual.
  5. prefixArray: This is a naive implementation. This takes up way too much memory as it currently is. The only reason why I didn't improve it further is because we'd most likely have to completely change its implementation if we want to do the RLFM and or r-index. The prefix array gives us a look up table that allows us map characterInTheFCoulmn -> characterInTheOriginalString. This allows us to extract the text.

Also, I mentioned the RLFM and r-index. The only reason why I didn't continue with doing these is because this has gotten complex enough... that and I figured it'd be good to make sure you want all of this in the first place. With that being said, I highly recommend we pursue them. In a very similar manner to how the skipList in this PR is constructed, we can compress the L column substantially- especially for sequences that are very repetitive. He has some very compelling numbers in this video towards the end. Also, once you see the compression, it makes sense how much of an improvement this could be.

Btw, I will fix the names and do some refactoring before taking this out of draft :).

Copy link
Collaborator

@TimothyStiles TimothyStiles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's late and I've realized that the linter has probably caught the typos that I did but way faster 😅

I like where this PR is heading . I don't understand everything yet and I definitely need to check out Ben Langmead's videos.

It'd be cool to do a little talk and demo for this. It's a great example for teaching and maybe we should share it back with Ben? Just some late night thoughts.

bwt/bwt.go Outdated

const nullChar = "0"

// BWT Burrow Wheeler Transform
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this opening explanation. Makes it very clear why this is necessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've probably put this somewhere but I think the naive implementation as pseudocode may be good to explain the very simple core of the algo and help demystify it.

The stuff from Langmead should also be cited somewhere in the code itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure! I'll do something similar with what I did with the wavelet tree to illustrate things like:

  • the BWT itself
  • how the first and last column have characters of the same rank in the same order that enables all the magic
  • LF mapping
  • Compressing "runny" sequences like the first column.

I'd be happy to cite him. Are there any examples of citation that exist in Poly? If not, is there a preferred way to do that?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that this is a great explanation!

I think that we don't have a standard way of citing things, we generally just provide a link to the original source. Might be worth compiling a list of properly formatted citations in the README, especially if we are planning on publishing.

Copy link
Contributor Author

@TwFlem TwFlem Dec 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good @carreter . For now, I Cited him at the end of the BWT's comment at the top.

bwt/bwt.go Outdated
"golang.org/x/exp/slices"
)

const nullChar = "0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the null char similar to the "$" I've seen appended to the input string in other implementations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly. It was "$" at first. I made it 0 for now because lexicographically, that's the character with the lowest value. This is just so if someone were to try and use the BWT with a string with spaces in it just to mess around and understand how it works, then it would work for them. Some follow up work that should be done:

  1. modify the sorting so if we encounter the nullChar we always make sure that is the least. Sort everything else lexographically. I can give this a shot in this PR if you want though. I didn't want to add more noise than necessary.
  2. Make that nullChar configurable. Neither "$" or "0" should show up in the protein, dna, or rna alphabets but... you never know how users might want to use this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be whatever we want now! Replied to your other comment.

bwt/bwt.go Outdated
return i
}
}
panic("Unable to find the corresponding orginal positiong for a character in the original sequence in the suffix array. This should not be possible and indicates a malformed BWT.")
Copy link
Collaborator

@TimothyStiles TimothyStiles Dec 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a little meta but what are your thoughts on panics and when to use them? I almost always try to return an error and have it be handled upstream for fear of killing a user's long run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends. I'm usually very hesitant to put them in anywhere as well. However in these cases in the PR, the panics are analogous to having an index out of bounds error for an array. If something wrong is happening like this in the internals of any of these data structures, then something has gone horribly wrong and the program can't recover since the BWT is likely borked.

That or if whoever is using is upstream is using it incorrectly, they can't recover from that incorrectness. Like in the case of the panic on the out of bounds on Extract. Imagine having to handle errors from index out of bounds.

Copy link
Contributor Author

@TwFlem TwFlem Dec 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought about this some more... Also, sounds like the Go team has some opinions on this: https://go.dev/blog/defer-panic-and-recover... Also read some opinions in the Chat :)

It does seem weird to bubble up errors like this because it is an internal problem, but Poly is also the first open source project I've contributed to. In my day job it's easier to force certain expectations and constraints everywhere since it is a controlled environment.

The more I think about it, the more is seems appropriate to give the users of Poly the choice of how they want to handle potential errors from bugs and or inappropriate API usage.

Maybe the example of the json package they mentioned in the blog post would also be appropriate here? We could panic internally and recover at the API boundary. That or we could just bubble errors all the way up for safety if for some reason we ever wanted to reuse the auxiliary components in here so we don't have any hidden panics that bubble up in the future if things get moved around. Maybe a combination of the three 🤔

Happy to change it up in any way!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still happy to change it up. With that said, I think the panics in bitvector are appropriate.

Copy link
Contributor Author

@TwFlem TwFlem Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it up so callers of BWT will not get any unexpected panics. Changes in f5b459b. LMK What you think.

bwt/bwt.go Outdated
prefixArray[i] = sequence[len(sequence)-i-1:]
}

// TODO: at the time of writing, the nullChar is 0, this is to ensure correctness in most cases.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may be a good idea? Is this why I've been seeing "$" in other implementations?

Copy link
Contributor Author

@TwFlem TwFlem Dec 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll just give the custom sorting function a shot on the string to ensure that the nullChar is always the least no matter what so we can put back the $.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, LMK what you think. nullChar could be configurable as a functional option now if users need to have $ as a part of their alphabet for the index they are throwing in to the BWT for whatever reason.

expected int
}

func TestBWT_Count(t *testing.T) {
Copy link
Collaborator

@TimothyStiles TimothyStiles Dec 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this case making a substring histogram?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually not sure what Count could be used for in the real world tbh.

I'm thinking that it could be a quick way to check for the existence of an exact match? That or for statistics like you mentioned?

Might be a good question for the community- is this useful functionality to expose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Koeng101 is the Count of sub-sequences useful to Expose?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is! There are instances where you are looking for multiple counts of sub-sequences (like restriction enzyme sites). Generally speaking, useful to have counts.

bwt/wavelet.go Outdated Show resolved Hide resolved
bwt/wavelet.go Outdated
// ## RSA Intuition
//
// From here you may be able to build some intuition as to how we can take RSA queries given
// a characters path encoding and which character we'd like to Rank, Select, and Access.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RSA is spelled out here! I was beginning to think I wouldn't find it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that. Yeah, I was thinking that was more appropriate to spell out in the RSABitvector and just telling readers to look there to learn more about RSA. Happy to make that redundant though so maintainers don't have to jump around to Grok all of these concepts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put Definitions of Acronyms towards the top of explanations like this. LMK if that works out.

bwt/wavelet.go Outdated Show resolved Hide resolved
bwt/wavelet.go Outdated Show resolved Hide resolved
bwt/wavelet.go Outdated Show resolved Hide resolved
@TwFlem
Copy link
Contributor Author

TwFlem commented Dec 9, 2023

It'd be cool to do a little talk and demo for this.

I'd be happy to do that. Right now, there's not too much to demo or talk about maybe. We can talk about saving memory, indexing strings, the magic of the BWT and how rank allows us to take advantage of that magic.

Maybe it'd be even more valuable to have a demo once we have the fuzzy matching that Koeng101 mentioned in #360 along with having the RLFM and r-index? At that point, the BWT seems like it would be feature complete enough to reach a critical mass of usefulness for Poly's users.

It's a great example for teaching and maybe we should share it back with Ben? Just some late night thoughts.

I'd definitely like to at least thank him! It's amazing that he is giving that knowledge to the community like that. Perhaps his channel deserves a spot in https://github.com/TimothyStiles/how-to-synbio. I'm happy to share anything else that he might appreciate as well!

TwFlem and others added 11 commits December 20, 2023 23:12
Typo

Co-authored-by: Willow Carretero Chavez <[email protected]>
doc correction

Co-authored-by: Willow Carretero Chavez <[email protected]>
typo

Co-authored-by: Willow Carretero Chavez <[email protected]>
doc clarity

Co-authored-by: Willow Carretero Chavez <[email protected]>
Link to additional explanation for Wavelet Trees.

Co-authored-by: Willow Carretero Chavez <[email protected]>
doc improvement

Co-authored-by: Willow Carretero Chavez <[email protected]>
Doc improvement.

Co-authored-by: Willow Carretero Chavez <[email protected]>
Doc Improvement

Co-authored-by: Willow Carretero Chavez <[email protected]>
Copy link
Collaborator

@TimothyStiles TimothyStiles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TwFlem this is looking really good!

Exposed test structs should be removed and example tests should be in their own separate example_test.go file in a separate bwt_test package.

It's not clear where a lot of the test cases are coming from and there should be more comments about where they're from, and what corner case their testing.

I'm with @carreter that it'd be good to wrap my head around the actual concepts but I'm also thinking we should find some outside folk more familiar with bwt to review this too if we can.

"testing"
)

type GetBitTestCase struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be exposed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It Shouldn't be. Definitions in _test aren't public.

package poly

import (
	"fmt"

	"github.com/bebop/poly/bwt"
)

func polyRoot() {
	thing := bwt.GetBitTestCase{}
	fmt.Println(thing)
}

^ That won't build

bwt/bwt_test.go Outdated
"golang.org/x/exp/slices"
)

func ExampleBWT_Count() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example tests should be in their own example_test.go file with the package name bwt_test so that package functions are called like bwt.Count to more closely emulate usage outside the package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// Output: 10
}

type BWTCountTestCase struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be exposed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be. Explained Above.

bwt/bwt_test.go Outdated
}
}

func ExampleBWT_Locate() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in example_test.go file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

bwt/bwt_test.go Outdated
func ExampleBWT_Extract() {
inputSequence := "AACCTGCCGTCGGGGCTGCCCGTCGCGGGACGTCGAAACGTGGGGCGAAACGTG"

bwt, err := New(inputSequence)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to example_test.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// Output: AACGTG
}

type BWTExtractTestCase struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be exposed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be. Explained above.

bwt/bwt.go Outdated
return searchRange.end - searchRange.start, nil
}

// Locate returns a list of offsets at which the begging
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: begging should be beginning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

bwt/wavelet.go Outdated
path bitvector
}

func NewWaveletTreeFromString(str string) waveletTree {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this supposed to be public? Needs Doc string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not. ty. Made it private.

@TwFlem
Copy link
Contributor Author

TwFlem commented Dec 21, 2023

It's not clear where a lot of the test cases are coming from and there should be more comments about where they're from, and what corner case their testing.

I added some comments about some edge cases that are being tested like with Jacobson's rank and the rsa_bitvector rank tests. I added a mention of how useful the the reconstruction test is for the WaveletTree. I also added a reconstruction test that pretty much guarantees that whatever we use for the suffix array in the future is well formed- otherwise we would never be able to reconstruct the string.

I added some more tests and more validation. So now we throw errors for empty and other invalid inputs. I fixed an edge case where a user could supply a string with a single alphabet character.

Otherwise, there aren't really too many edge cases to go around with the FM index BWT. It's all predicated on the fact that the LF search works. LF search is all predicated on the fact that the Wavelet Tree and other auxiliary DSAs work as expected. The other tests are just to meant the show that auxiliary pieces work as expected. The RSA and Bit vector test cases are verbose because of all the bit math involved. Those are there to give us peace of mind that it works as expected.

There will be some more BWT edge cases with the other improvements that I'll follow up with like the RLFM index (done) and the r-index (in progress). Those I can definitely call out. Those changes will be pretty small btw. Nothing like getting the basic BWT off the ground 😭

I'm with @carreter that it'd be good to wrap my head around the actual concepts but I'm also thinking we should find some outside folk more familiar with bwt to review this too if we can.

Sounds good! It might be better to emphasize the correctness over performance improvements. Reasonable performance improvements are already on the way 👀


Lets use banana as an example.

banana$ $banana
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can there be an example test that takes banana$ and returns the last column as a string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TimothyStiles updated

bwt/bwt.go Outdated
return bwt.firstColumnSkipList[i]
}
}
panic("figure out what to do here")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what should be done here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ty, updated this along with one hiding in the bitvector 1cc3755

"github.com/bebop/poly/bwt"
"golang.org/x/exp/slices"
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should have an Example_basic test here showing how to use the package's basic functionality end-to-end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated 1cc3755

@TimothyStiles TimothyStiles merged commit 176dd5b into bebop:main Jan 18, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants